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Abstract 

Semisupervised learning has emerged as a popular framework for improving modeling accuracy while 
controlling labeling cost. Based on an extension of stochastic composite likelihood we quantify the 
asymptotic accuracy of generative semi-supervised learning. In doing so, we complement distribution-free 
■ analysis by providing an alternative framework to measure the value associated with different labeling 

policies and resolve the fundamental question of how much data to label and in what manner. We 
demonstrate our approach with both simulation studies and real world experiments using naive Bayes 
^ ■ for text classification and MRFs and CRFs for structured prediction in NLP. 
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1 Introduction 

> , 

Semisupervised learning (SSL) is a technique for estimating statistical models using both labeled and unla- 
beled data. It is particularly useful when the costs of obtaining labeled and unlabeled samples arc different. 
, In particular, assuming that unlabeled data is more easily available, SSL provides improved modeling accu- 

racy by adding a large number of unlabeled samples to a relatively small labeled dataset. 

The practical value of SSL has motivated several attempts to mathematically quantify its value beyond 
' traditional supervised techniques. Of particular importance is the dependency of that improvement on the 

amount of unlabeled and labeled data. In the case of structured prediction the accuracy of the SSL estimator 
depends also on the specific manner in which sequences are labeled. Focusing on the framework of generative 
or likelihood-based SSL applied to classification and structured prediction we identify the following questions 
which we address in this paper. 

Ql: Consistency (classification) . What combinations of labeled and unlabeled data lead to precise models 
in the limit of large data. 

Q2: Accuracy (classification). How can we quantitatively express the estimation accuracy for a particular 
generative model as a function of the amount of labeled and unlabeled data. What is the improvement in 
estimation accuracy resulting from replacing an unlabeled example with a labeled one. 
Q3: Consistency (structured prediction) . What strategies for sequence labeling lead to precise models in the 
limit of large data. 

Q4: Accuracy (structured prediction). How can we quantitatively express the estimation quality for a par- 
ticular model and structured labeling strategy. What is the improvement in estimation accuracy resulting 
from replacing one labeling strategy with another. 

Q5: Tradeoff (classification and structured prediction). How can we quantitatively express the tradeoff 
between the two competing goals of improved prediction accuracy and low labeling cost. What are the 
possible ways to resolve that tradeoff optimally within a problem-specific context. 
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Q6: Practical Algorithms. How can we determine how much data to label in practical settings. 

The first five questions are of fundamental importance to SSL theory. Recent related work has concen- 
trated on large deviation bounds for discriminative SSL as a response to Ql and Q2 above. While enjoying 
broad applicability, such non-parametric bounds are weakened when the model family's worst-case is atypi- 
cal. By forgoing finite sample analysis, our approach complements these efforts and provides insights which 
apply to the specific generative models under consideration. In presenting answers to the last question, we 
reveal the relative merits of asymptotic analysis and how its employ, perhaps surprisingly, renders practical 
heuristics for controlling labeling cost. 

Our asymptotic derivations are possible by extending the recently proposed stochastic composite likeli- 
hood formalism [5] and showing that generative SSL is a special case of that extension. The implications 
of this analysis arc demonstrated using a simulation study as well as text classification and NLP structured 
prediction experiments. The developed framework, however, is general enough to apply to any generative 
SSL problem. As in [7], the delta method transforms our results from parameter asymptotics to prediction 
risk asymptotics. We omit these results for lack of space. 

2 Related Work 

Semisupervised learning has received much attention in the past decade. Perhaps the first study in this 
area was done by Castelli and Cover [3] who examined the convergence of the classification error rate as a 
labeled example is added to an unlabeled dataset drawn from a Gaussian mixture model. Nigam et al. [9] 
proposed a practical SSL framework based on maximizing the likelihood of the observed data. An edited 
volume describing more recent developments is [1]. 

The goal of theoretically quantifying the effect of SSL has recently gained increased attention. Sinha 
and Belkin jllj examined the effect of using unlabeled samples with imperfect models for mixture models. 
Balcan and Blum [T] and Singh et al. |10j analyze discriminative SSL using PAC theory and large deviation 
bounds. Additional analysis has been conducted under specific distributional assumptions such as the 
"cluster assumption" , "smoothness assumption" and the "low density assumption." [3] However, many of 
these assumptions are criticized in [2]. 

Our work complements the above studies in that we focus on generative as opposed to discriminative SSL. 
In contrast to most other studies, we derive model specific asymptotics as opposed to non-parametric large 
deviation bounds. While such bounds are helpful as they apply to a broad set of cases, they also provide 
less information than model-based analysis due to their generality. Our analysis, on the other hand, requires 
knowledge of the specific model family and an estimate of the model parameter. The resulting asymptotics, 
however, apply specifically to the case at hand without the need of potentially loose bounds. 

We believe that our work is the first to consider and answer questions Q1-Q6 in the context of generative 
SSL. In particular, our work provides a new framework for examining the accuracy-cost SSL tradeoff in a 
way that is quantitative, practical, and model-specific. 

3 Stochastic SSL Estimators 

Generative SSL [9j H] estimates a parametric model by maximizing the observed likelihood incorporating L 
labeled and U unlabeled examples 

L L+U 
^)=5>gp e (XW F«)+ £ logpg(X^) (1) 
i=l i=L+l 

where pg(X^) above is obtained by marginalizing the latent label ^2 y P9(X^\y). A classical example is 
the naive Bayes model in [3] where pe(X,Y) = p g (X\Y)p(Y), p s (X\Y = y) = Mult([0j,]i, . . . , [9 y ]v)- The 
framework, however, is general enough to apply to any generative model pg(X,Y). 
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To analyze the asymptotic behavior of the maximizer of (JXJ) we assume that the ratio between labeled to 
unlabeled examples A = L/(L + U) is kept constant while n = L + U — > oo. More generally, we assume a 
stochastic version of ([1]) where each one of the n samples . . . , is labeled with probability A 

n n 

e n (6) = z(i) logp e (XW,y«) + ^(1 - Z®)logp (X®), Z® ~ Bin(l, A). (2) 

i=l i=l 

The variable above is an indicator taking the value 1 with probability A and otherwise. Due to the law 
of large numbers for large n we will have approximately L = nX labeled samples and U = n(l — A) unlabeled 
samples thus achieving the asymptotic behavior of ([TJ. 

Equation @ is sufficient to handle the case of classification. However, in the case of structured prediction 
we may have sequences X^ , where for each i some components of the label sequence are missing and 
some are observed. For example one label sequence may be completely observed, another may be completely 
unobserved, and a third may have the first half labeled and the second half not. 

More formally, we assume the existence of a sequence labeling policy or strategy p which maps label 
sequences 

= (yW ) . . . f yW) to a subset corresponding to the observed labels p(T w ) C {Y^\ . . . , Y^}. 
To achieve full generality we allow the labeling policy p to be stochastic, leading to different subsets of 
{Y^ l \ . . . ,Ym^} with different probabilities. A simple "all or nothing" labeling policy could label the entire 
sequence with probability A and otherwise ignore it. Another policy may label the entire sequence, the first 
half, or ignore it completely with equal probabilities 

{Y^ ] Y$ with probability 1 /3 
with probability 1/3 . (3) 

Y®,. . . , yg /aJ with probability 1/3 

We thus have the following generalization of © for structured prediction 

n 

£ n (6) = Y,log P e(p(Y(%X^). (4) 

i=l 

Equation @ generalizes standard SSL from all or nothing labeling to arbitrary labeling policies. The 
fundamental SSL question in this case is not simply what is the dependency of the estimation accuracy on 
n and A. Rather we ask what is the dependency of the estimation accuracy on the labeling policy p. Of 
particular interest is the question what labeling policies p achieve high estimation accuracy coupled with 
low labeling cost. Answering these questions leads to a generative SSL theory that quantitatively balances 
estimation accuracy and labeling cost. 

Finally, we note that both ([2]) and ((4]) arc random variables whose outcomes depend on the random 
variables Z^\ . . . ,Z^ (for @) or p (for Consequentially, the analysis of the maximizer 8 n of ^ or 

(jj) needs to be done in a probabilistic manner. 

4 Al: Consistency (Classification) 

Assuming that the data is generated from pg (X, Y) consistency corresponds to the convergence of 

Q n = argmax£ n (6>) (5) 
e 

to 9q with probability 1 as n — s- oo (£ n is defined in ©). This implies that in the limit of large data our 
estimator would converge to the truth. Note that large data n — > oo in this case means that both labeled 
and unlabeled data increase to oo (but their relative sizes remain the constant A). 
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We show in this section that the maximizer of © is consistent assuming that A > 0. This is not an 
unexpected conclusion but for the sake of completeness we prove it here rigorously. The proof technique will 
also be used later when we discuss consistency of SSL estimators for structured prediction. 

The central idea in the proof is to cast the generative SSL estimation problem as an extension of stochastic 
composite likelihood [5]. Our proof follows similar lines to the consistency proof of [5] with the exception 
that it does not assume independence of the indicator functions Z^> and (1 — ZW) as is assumed there. 

Definition 1. A distribution pg(X, Y) is said to be identifiable if 9 ^ r\ entails that pg(X, Y) — p v (X, Y) is 
not identically zero. 

Proposition 1. Let C W be a compact set, and p$(x,y) > be identifiable and smooth in 9. Then if 
A > the maximizer 9 n of @ is consistent i.e., 9 n — > 9o as n — > oo with probability 1. 

Proof. The likelihood function, modified slightly by a linear combination with a constant is £' n (9) = 

n 1 n 

- ( z(t) logp»(XW,yW) - X\ogp eo (X^, y»)) Z®)]ogp e (X®) - (1 - A) log Pft) , 

i=l i=l 

converges by the the strong law of large numbers as n — > oo to its expectation with probability 1 

n(9) = -\D(pe (X,Y)\\pe(X,Y))-(l-\)D(p 9a (X)\\p e (X))). 

If we restrict ourselves to the compact set S = {9 : c\ < \\6 — 9o\\ < C2} then | \ogpe(X, Y)\ < K(X, Y) < 
00, V0 G S. As a result, the conditions for the uniform strong law of large numbers, cf. chapter 16 of [6], 
hold on S leading to 

p{ lim sup \l' n (6) -n(0)\ =0) =1. (6) 

Due to the identifiability of p$(X,Y) we have D(pg (X,Y)\\pg(X,Y)) > with equality iff 9 = 9q. Since 
also D(pg (X)\\pg(X))) > we have that [i(9) < with equality iff 9 = 9q (assuming A > 0). Furthermore, 
since the function /j,(9) is continuous it attains its negative supremum on the compact S: sup egS ^(6') < 0. 

Combining this fact with ([6]) we have that there exists N such that for all n > TV the likelihood maximizers 
on S achieves strictly negative values of £' n (9) with probability 1. However, since £' n (9) can be made to 
achieve values arbitrarily close to zero under 9 = 9q, we have that 6 n S for n > N. Since c\, ci were chosen 
arbitrarily 9 n — > 9q with probability 1. □ 

The above proposition is not surprising. As n — > 00 the number of labeled examples increase to 00 
and thus it remains to ensure that adding an increasing number of unlabeled examples does not hurt the 
estimator. More interesting is the quantitative description of the accuracy of 9 n and its dependency on 
60, A, n which wc turn to next . 



5 A2: Accuracy (Classification) 

The proposition below states that the distribution of the maximizer of ^ is asymptotically normal and 
provides its variance which may be used to characterize the accuracy of 9 n as a function of n, do, A. As in 
Section 0] our proof proceeds by casting generative SSL as an extension of stochastic composite likelihood. 

In Proposition^ (below) and in PropositionUwe use Var e (H) to denote the variance matrix of a random 
vector H under pg a . The notations , denote convergences in probability and in distribution [6] and 
V f(9), V 2 /(6*) are the r x 1 gradient vector and r x r matrix of second order derivatives of f{9). 

Proposition 2. Under the assumptions of Proposition^ as well as convexity of Q we have the following 
convergence in distribution of the maximizer of ^ 

Vn~(9 n -do)^N(0,X- 1 ) (7) 
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as n — > oo, where 

^ = XVar eo (V 1 ) + (l-X)Vare (V 2 ) 
V x = W g \ogpe (X,Y), V 2 = V e logp eo (X). 

Proof. By the mean value theorem and convexity of 0, there is r\ € (0, 1) for which 8'=8o + r/(9 n — 8o) and 

VtniL) = Vtntfo) + V 2 e n (0')(6n - 8 ). 

Since 9 n maximizes £ n we have V£ n {9 n ) = and 

V^0 n - Oo) = -VH (V 2 £„(0')) _1 (V^(flo)) ■ (8) 

By Proposition [T] we have 9 n A 6*o which implies that 9' A #o as well. Furthermore, by the law of large 
numbers and the fact that W n A W implies ^(VFn) A g(M^) for continuous g, 

{V 2 i n {8'))- 1 A- (V^o))" 1 (9) 
A fAE eo V 2 logp eo (X,r) + (l-A)E eo V 2 log Peo (X) N ) _1 =S- J 



where in the last equality we used a well known identity concerning the Fisher information. 
For the remaining term in the rhs of ([8]) we have 



n 

-V^vt n (e ) = -V^- Y(w^ + qW) (io) 

T? ^ » 



n 

i=l 



where W® = Z^y\ogpe {X^\Y^), Q« = (1 - Z®)V ' \ogp g „{X {V) ). Since (HUJ) is an average of iid 
random vectors W^' + it is asymptotically normal by the central limit theorem with mean 

E e (Q + W) = AE 6o V logp 6o (X, Y) + (1 - A)E V logp eo (X) = AO + (1 - A)0. 

and variance 

Var eo (W + Q) = E 9o W 2 + E g Q 2 + 2E 6o WQ 
= AVar eo y 1 + (l-A)Var 9o F 2 

where we used E (Z(l - Z)) = EZ-EZ 2 = Q. 
We have thus established that 

-^V4(0o)~>iV(O,£). (11) 
We finish the proof by combining flSJ), (fT5)l and (fTT|) using Slutsky's theorem. □ 

Proposition [2] characterizes the asymptotic estimation accuracy using the matrix E. Two convenient 
one dimensional summaries of the accuracy are the trace and the determinant of E. In some simple cases 
(such as binary event naive Bayes) tr(E) can be brought to a mathematically simple form which exposes its 
dependency on 9o,n, A. In other cases the dependency may be obtained using numerical computing. 

Figure [T] displays three error measures for the multinomial naive Bayes SSL classifier [9] and the Reuters 
RCV1 text classification data. In all three figures the error measures are represented as functions of n 
(horizontal axis) and A (vertical axis). The error measures are classification error rate (left), trace of the 
empirical mse (middle), and log-trace of the asymptotic variance (right). The measures were obtained over 
held-out sets and averaged using cross validation. Figure [3] (middle) displays the asymptotic variance as a 
function of n and A for a randomly drawn 9q. 

As expected the measures decrease with n and A in all the figures. It is interesting to note, however, 
that the shapes of the contour plots are very similar across the three different measures (top row). This 
confirms that the asymptotic variance (right) is a valid proxy for the finite sample measures of error rates 
and empirical mse. We thus conclude that the asymptotic variance is an attractive measure that is similar 
to finite sample error rate and at the same time has a convenient mathematical expression. 
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6 A3: Consistency (Structured) 



In the case of structured prediction the log-likelihood (U]) is specified using a stochastic labeling policy. In 
this section we consider the conditions on that policy that ensures estimation consistency, or in other word 
convergence of the maximizer of (|4]) to 9q as n — > oo . 

We assume that the labeling policy p is a probabilistic mixture of deterministic sequence labeling functions 
Xif->Xk- I n other words, p(Y) takes values Xi(Y), i = l,...,k with probabilities Ai, . . . , Afe. For example 
the policy © corresponds to X i(Y) = Y, xaOO - 0, Xs(Y) = {Yi, . . . , Yy m/2] } (where Y = {Y u ...,Y m }) 
and A = (1/3,1/3,1/3). 

Using the above notation we can write (J2J as 

n k 

^(») = EE^ lo g»te( y(,) )> I(,) ) ( 12 ) 
i=i j=i 

zW~Mult(l > (Ai,... J A fc )) 

which exposes its similarity to the stochastic composite likelihood function in [5]. Note however that (TT^|) is 
not formally a stochastic composite likelihood since Z^,j = 1, . . . , k arc not independent and since Xj{Y) 
depends on the length of the sequence Y (see for example X i an d X 3 above). We also use the notation S™ 
for the subset of labels provided by \j on length-m sequences 

Xj {Y 1 ,...,Y m ) = {Y i :i€SJ 1 }. 

Definition 2. A labeling policy is said to be identifiable if the following map is injective 

k 

(J \J{M{Y r :reS™},X)}^p g (X 7 Y) 

m:q(m)>0 j=l 

where q is the distribution of sequences lengths. In other words, there is at most one collection of probabilities 
corresponding to the lhs above that does not contradict the joint distribution. 

The importance of Definition [2] is that it ensures the recovery of 9q from the sequences partially labeled 
using the labeling policy. For example, a labeling policy characterized by Xi(Y) = Ai = 1 (always label 
only the first sequence element) is non-identifiable for most interesting pg as the first sequence component 
is unlikely to provide sufficient information to characterize the parameters associated with transitions Y t — > 
Y t+1 . 

Proposition 3. Assuming the same conditions as Proposition^ and Ai,...,Afe > with identifiable 
Xi, ■ ■ ■ ,Xk> the maximizer of (|12[) is consistent i.e., 6 n — > #o as n — > oo with probability 1. 

Proof. The log-likelihood modified slightly by a linear combination with a constant is 

n k 

e 'n( e ) = -J2J2{ Z f ) logp e (x J (F«) J XW)-A, logp eo ( Xj (y W ) J X«)). 

71 i=l 3=1 

By the strong law of large numbers £' n (6) converges to its expectation 

k 

M0) = -J2 X i ■ E «( m ) ' D (P0o({Yi ■■ * G S?} t X)\\po({Yi : i e S?},X)). 

j=l m>0 

Since fi is a linear combination of KL divergences with positive weights it is non-negative and is if 
6 = Oq. The identifiability of the labeling policy ensures that fi(6) > if 9 ^ 0$. We have thus established 
that £ n (9) converges to a non- negative continuous function fj,(6) whose maximum is achieved at #o. The rest 
of the proof proceeds along similar lines as Proposition [3] □ 
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Ultimately, the precise conditions for consistency will depend on the parametric family pg under consid- 
eration. For many structured prediction models such as Markov random fields the consistency conditions 
are mild. Depending on the precise feature functions, consistency is generally satisfied for every policy that 
labels contiguous subsequences with positive probability. However, some care need to be applied for models 
like HMM containing parameters associated with the start label or end label and with models asserting 
higher order Markov assumptions. 



7 A4: Accuracy (Structured) 



We consider in this section the dependency of the estimation accuracy in structured prediction SSL Q on 
n, 9q but perhaps most interestingly on the labeling policy p. Doing so provides insight into not only how 
much data to label but also in what way. 

Proposition 4. Under the assumptions of Proposition^ as well as convexity of Q we have the following 
convergence in distribution of the maximizer of f| 12[) 



V^(0 n - 0o) ^ N (O.E- 1 ) (13) 

as n — } oc, where 

E -1 = E q[m) \^p\ 3 Var eo {VV ]m ) 

V 3m =\ogp eQ {{Y l :ieSf},X). 
Proof. By the mean value theorem and convexity of 9 there is rj <G (0, 1) for which 8' = Oo+il(@n — $o) and 

V4(<U = V4(0o) + V 2 4(0')(^ - o ). 

Since 8 n maximizes £, V^„(6' n ) = and 

M6n - 0o ) = -^(V 2 £„(0')) _1 V4(0o). (14) 

By Proposition [3] we have n -4- 9q which implies that 8' -4- do as well. Furthermore, by the law of large 
numbers and the fact that if W n -t> W then g(W n ) A g{W) for continuous g, 

(vH^e'))- 1 A (v 2 ^))- 1 (15) 

5>(m)5>,E eo (V 2 ^ m ) 

v m>0 j=l 
k 

,m>0 j = l 

where in the last equality we used a well known identity concerning the Fisher information. 
For the remaining term on the rhs of (|14[) we have 

V^W n (6 )=V^-YWi (16) 

where the random vectors 



n 

i=l 



k 

W i = ^ l{lcngth(yW)=m) Y Z \ %) ^ V jm 
m>0 j=l 
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have expectation due to the fact that the expectation of the score is 0. The variance of Wi is 

k 

Var 9o Wi = Eg ^2 1 {lcn g th(ri'))=m| ^ Z j 1 ^ V im'^ V im T 
m>0 j=l 

= ^ (Z M^a j e(v^v^ t ) 

m>0 j=l 

where in the first equality we used the fact that can have only one length and only one of xii ■ ■ ■ > Xk is 
chosen. Using the central limit theorem we thus conclude that 

and finish the proof by combining (fLT| . (fTS)) , and (jTTJ) using Slutsky's theorem. □ 

Figure [2] (left, middle) displays the test-set per-sequence perplexity for the CoNLL2000 chunking task 
as a function of the total number of labeled tokens. We used the Boltzmann chain MRF model that is the 
MRF corresponding to HMM (though not identical e.g., [5]). We consider labeling policies p that label the 
entire sequence with probability A and otherwise label contiguous sequences of length 5 (left) or leave the 
sequence fully unlabeled (middle). Lighter nodes indicate larger n and unsurprisingly show a decrease in 
the test-set perplexity as n is increased. Interestingly, the middle figure shows that labeling policies using a 
smaller amount of labels may outperform other policies. This further motivates our analysis and indicates 
that naive choices of p may be inefficient, viz. inflating labeling cost with negligible accuracy improvement 
to accuracy (cf. also Sec. [8] for how to avoid this pitfall). 

7.1 Conditional Structured Prediction 

Thus far our discussion on structured prediction has been restricted to generative models such as HMM or 
Boltzmann chain MRF. Similar techniques, however, can be used to analyze SSL for conditional models such 
as CRFs that are estimated by maximizing the conditional likelihood. The key to extending the results in 
this paper to CRFs is to express conditional SSL estimation in a form similar to (|4]) 

n 

6 n = argmax^logp e (p(y«)|X( 4 )) 

i=i 

and to proceed with an asymptotic analysis that extends the classical conditional MLE asymptotics. Wc 
omit further discussion due to lack of space but include some experimental results for CRFs. 

Figure [3] (left) depicts a similar experiment to the one described in the previous section for conditional 
estimation in CRF models. The figure displays per-sequence perplexity as a function n (x axis) and Ai (y 
axis). We observe a trend nearly identical to that of the Boltzmann chain MRF (Figure [2j left, middle). 

8 A5: Tradeoff 

As the figures in the previous sections display, the estimation accuracy increases with the total number of 
labels. The Cramer-Rao lower bound states that the highest accuracy is obtained by the maximum likelihood 
operating on fully observed data. However, assuming that a certain cost is associated with labeling data SSL 
resolves a fundamental accuracy-cost tradeoff. A decrease in estimation accuracy is acceptable in return for 
decreased labeling cost. 

Our ability to mathematically characterize the dependency of the estimation accuracy on the labeling 
cost leads to a new quantitative formulation of this tradeoff. Each labeling policy (A, n in classification and 
p in structured prediction) is associated with a particular estimation accuracy via Propositions [5] and 2] 
and with a particular labeling cost. The precise way to measure labeling cost depends on the situation at 
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500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 

Figure 1: Three error measures for the multinomial naive Bayes SSL classifier applied to Reuters RCVf text 
data. In each, error is a function of n (horizontal axis) and A (vertical axis). The left depicts classification 
error rate, the middle depicts the trace of empirical mse, and right depicts the log-trace of the asymp- 
totic variance. Results were obtained using hcld-out sets and averaged using cross validation. Particularly 
noteworthy is a striking correlation among all three figures, justifying the use of asymptotic variance as a 
surrogate for classification error, even for relatively small values of n. 




x10 4 x10 4 

Figure 2: Test-set results for two policies of unlabeled data for Boltzmann chain MRFs applied to the CoNLL 
2000 text-chunking dataset (left, middle). The shaded portion of the right panel depicts the empirically 
unachievable region for naive Bayes SSL classifier on the 20- newsgroups dataset. The left two share a 
common log-pcrplexity scale (vertical axis) while the vertical axis of the right panel corresponds to trace 
of the empirical MSE; the horizontal axis indicates labeling cost. As above, results were obtained using 
hcld-out sets and averaged using cross validation. Collectively these figures represent the application and 
effect of various labeling policies. The left figure depicts the consequence of partially missing samples for 
various n,A while the middle and right represent SSL in the more traditional all or nothing sense: either 
labeled or unlabeled samples. See text for more details. 
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Figure 3: Left figure depicts sentence- wise log-perplexity for CRFs under the same policy and experimental 
design of the above Boltzmann chain. Center figure represents log-trace of the theoretical variance and 
demonstrates phenomena under a simplified scenario, i.e., a mixture of two 1000-dim multinomials with 
unbalanced prior. Rightmost figure demonstrates the practical applicability of utilizing asymptotic analysis 
to characterize parameter error as a function of size of training-set partition. The training-set is fixed at 
2000 samples and split for training and validating. As the proportion used for training is increased, we see 
a decrease in error. See text for more details. 



hand, but we assume in this paper that the labeling cost is proportional to the numbers of labeled samples 
(classification) and of labeled sequence elements (structured prediction). This assumption may be easily 
relaxed by using other labeling cost functions e.g, obtaining unlabeled data may incur some cost as well. 

Geometrically, each labeling policy may thus be represented in a two dimensional scatter plot where the 
horizontal and vertical coordinates correspond to labeling cost and estimation error respectively. Three such 
scatter plots appear in Figure [5] (see Section [7] for a description of the left and middle panels). The right 
panel corresponds to multinomial naive Bayes SSL classifier and the 20-newsgroups classification dataset. 
Each point in that panel corresponds to different n, A. 

The origin corresponds to the most desirable (albeit unachievable) position in the scatter plot representing 
zero error at no labeling cost. The cloud of points obtained by varying n, A (classification) and p (structured 
prediction) represents the achievable region of the diagram. Most attractive is the lower and left boundary 
of that region which represents labeling policies that dominate others in both accuracy and labeling cost. 
The non- achievable region is below and to the left of that boundary (see shaded region in Figure [2 right). 
The precise position of the optimal policy on the boundary of the achievable region depends on the relative 
importance of minimizing estimation error and minimizing labeling cost. A policy that is optimal in one 
context may not be optimal in a different context. 

It is interesting to note that even in the case of naive Bayes classification (Figure [3J right) some labeling 
policies (corresponding to specific choices of n, A) arc suboptimal. These policies correspond to points in the 
interior of the achievable region. A similar conclusion holds for Boltzmann chain MRF. For example, some 
of the points in Figure [2] (left) denoted by the label 700 are dominated by the more lightly shaded points. 

We consider in particular three different ways to define an optimal labeling policy (i.e., determining how 
much data to label) on the boundary of the achievable region 

(A*,n*)i= argmin tr(£ _1 ) (17) 

(\,n):\n<C 

(A*,n*)2= argmin An (18) 

(A,n):tr(S- 1 )<C 

(\*,n*) 3 = argmin An + a tr(£ _1 ). (19) 

(A,n) 

The first applies in situations where the labeling cost is bounded by a certain available budget. The second 
applies when a certain estimation accuracy is acceptable and the goal is to minimize the labeling cost. The 
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third considers a more symmetric treatment of the estimation accuracy and labeling cost. 

Equations (fT7)) - ([T9| may be easily generalized to arbitrary labeling costs f(n, A). Equations (fTTj) - (fT9|) 
may also be generalized to the case of structured prediction with p replacing (A, n) and cost(p) replacing 
Xn. 

9 A6: Practical Algorithms 

Choosing a policy (A, n) or p resolves the SSL tradeoff of accuracy vs. cost. Such a resolution is tantamount to 
answering the basic question of how many labels should be obtained (and in the case of structured prediction 
also which ones). Resolving the tradeoff via ()17[) - (119[) or in any other way, or even simply evaluating the 
asymptotic accuracy tr(E) requires knowledge of the model parameter 9q that is generally unknown in 
practical settings. 

We propose in this section a practical two stage algorithm for computing an estimate 9 n within a particular 
accuracy-cost tradeoff. Assuming we have n unlabeled examples, the algorithm begins the first stage by 
labeling r samples. It then estimates 9' by maximizing the likelihood over the r labeled and n — r unlabeled 
samples. The estimate 9' is then used to obtain a plug- in estimate for the asymptotic accuracy tr(S). In 

the second stage the algorithm uses the estimate tr(S) to resolve the tradeoff via (|17|) - (fT9|) and determine 
how many more labels should be collected. Note that the labels obtained at the first stage may be used in 
the second stage as well with no adverse effect. 

The two-stage algorithm spends some initial labeling cost in order to obtain an estimate for the quanti- 
tative tradeoff parameters. The final labeling cost, however, is determined in a principled way based on the 
relative importance of accuracy and labeling cost via (fl"7|) - (fT9|) . The selection of the initial number of labels 
r is important and should be chosen carefully. In particular it should not exceed the total desirable labeling 
cost. 

We provide some experimental results on the performance of this algorithm in Figure[3] (right). It displays 
box-plots for the differences between tr(S) and tr(E) as a function of the initial labeling cost r for naive 
Bayes SSL classifier and 20-newsgroups data. The figure illustrates that the two stage algorithm provides a 
very accurate estimation of tr(E) for r > 1000 which becomes almost perfect for r > 1300. 
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